Iterative machine learning-based chemical similarity search to identify novel chemical inhibitors

Machine learning-based chemical screening has made substantial progress in recent years. However, these predictions often have low accuracy and high uncertainty when identifying new active chemical scaffolds. Hence, a high proportion of retrieved compounds are not structurally novel. In this study, we proposed a strategy to address this issue by iteratively optimizing an evolutionary chemical binding similarity (ECBS) model using experimental validation data. Various data update and model retraining schemes were tested to efficiently incorporate new experimental data into ECBS models, resulting in a fine-tuned ECBS model with improved accuracy and coverage. To demonstrate the effectiveness of our approach, we identified the novel hit molecules for the mitogen-activated protein kinase kinase 1 (MEK1). These molecules showed sub-micromolar affinity (Kd 0.1–5.3 μM) to MEKs and were distinct from previously-known MEK1 inhibitors. We also determined the binding specificity of different MEK isoforms and proposed potential docking models. Furthermore, using de novo drug design tools, we utilized one of the new MEK inhibitors to generate additional drug-like molecules with improved binding scores. This resulted in the identification of several potential MEK1 inhibitors with better binding affinity scores. Our results demonstrated the potential of this approach for identifying novel hit molecules and optimizing their binding affinities. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-023-00760-6.


Figure S1
. The duplicated dose response curves to determine Kd values of chemical compounds are shown for MEK1, MEK2, and MEK5.X-axis represents ligand concentration (nM) and Y-axis relative inhibitory activity by KdELECT service.We used a maximum 10 µM concentration for all test compounds as a threshold to select active compounds.The dose response curves marked by star(*) represent the case that higher chemical concentration than 10 µM is required to reach a plateau and make more accurate Kd determination.# AUC values for WEE1 are not available because of the absence of newly identified active compounds from the initial screening results.
Table S2.The chemical compounds are labeled as Pnew (new active), Pprv (previous active), Nnew (new inactive), and Nprv (previous random inactive data).Among PP, NP, and NN, the data size of the NN is generally the largest because inactive data are usually much more abundant than active data after experimental validation.In addition, to train the ECBS models, the parameter defining the number of random compounds is set to be four times larger than the number of active compounds.
The parameter might be reduced by fine-tuning the model performance; however, empirically, including more random data is preferred to represent diverse inactive compounds.
With the following two simplified assumptions: 1) the number of new active compounds (Pnew) is very small and the majority of the experimental data are inactive, and 2) the random compounds (Nprv) are sampled four times more than the previous active compounds (Pprv); the data size and ratio were estimated as follows.

Figure S2 .
Figure S2.Structurally similar molecules were identified via substructure search in Reaxys

Figure S4 .
Figure S4.Two-dimensional interaction diagram of previously reported MEK1 inhibitors

Figure S7 .
Figure S7.The molecules selected based on binding free energy scores from either MM/GBSA or Comparison of the prediction performance of the standard single chemical-based Random Forest model with the ECBS model trained with PP-NP-NN data.The chemical screening performance of the ECBS model retrained with PP-NP-NN data (ECBS) is compared with that of a typical Random Forest (RF) model based on individual compounds (Single).The typical RF model was newly built with identical training data used to train the ECBS model but based on single chemical structures.Thus, it is a binary classification model trained using individual compounds instead of chemical pairs.All training and test data were identical for the two models, although the basal data format was different: chemical pairs vs. individual chemicals.To compare its prediction accuracy with that of the ECBS model, we designed a simple scoring scheme to rank individual compounds by ECBS scores ; the maximum value among the ECBS scores assigned to the known active compounds was considered the final score for each compound in the test set.The test data were divided into three categories: 1) Known Actives (test only for the original data), 2) New Exp.Data: Test only for the new experimental data used to retrain the ECBS model and 3) all tests for both.The AUC PR values were used to estimate the prediction performance.Chemical compounds with lower than POC 20% and higher than 80% are defined as active and inactive compounds, respectively.
set 1 to represent a small number of new active compounds, Assumption 1) Pprv = p (fold ratio of known active compounds to Pnew) Nnew = n (fold ratio of new inactive compounds to Pnew) •P new = p + µ * µ * : optional self-pairing of Pnew (= Pnew• (Pnew-1)/2) is added in the present study to give more weight to new active compounds)

Table S1 .
Experimental chemical activity data and cross-validation results (AUC of Precision-Recall curve) for each test target protein.The format of this table is identical to that of Table 1, but the model performance was calculated only for the new active and inactive compounds without * Chemical compounds with lower than POC 20% and higher than 80% are defined as active and inactive compounds, respectively.

Table S3 .
Estimation of chemical pair data size

Table S4 .
LogP values for the tested compounds.The consensus LogP values for the compounds are calculated using SwissADME as an indicator of cell permeability.All three MEK1 inhibitors and the positive control (PD98059) are within the modest LogP range (-0.5 < LogP < 5).

Table S5 .
GNINA docking scores for MEKs are shown with biochemical binding affinity data inTable 3. GINIA provides CNN (convolution neural network) affinity score with Autodock Vina docking score.The higher CNN affinity score and lower Autodock Vina score represents better binding.

Table S6 .
The target prediction results for ZINC5814210 from Swiss target prediction server.

Table S7 .
The target prediction results for ZINC5814210 from Structure Ensemble Approach (SEA) server.